Skip to content

Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090

Open
parthosa wants to merge 2 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2089
Open

Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090
parthosa wants to merge 2 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2089

Conversation

@parthosa
Copy link
Copy Markdown
Collaborator

Fixes #2089

Changes

Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374). The AutoTuner should no longer recommend spark.rapids.sql.concurrentGpuTasks for apps using that plugin version or later.

Logic (in AutoTuner.calculateClusterLevelRecommendations):

  • Extract the unique plugin jar version from appInfoProvider.getRapidsJars using the existing pluginJarRegEx.
  • If version >= 25.06.0, skip the recommendation by adding the key to skippedRecommendations. This also suppresses the "was not set" missing comment.
  • Target cluster enforced and preserve overrides take precedence — recommendation is still emitted in those cases.
  • If the plugin jar version cannot be determined (no jars or multiple distinct versions), fall back to the existing recommendation logic.

New helper: Platform.isPropertyUserOverridden(key) — returns true if the property is in enforced or preserve. Lives next to getUserEnforcedSparkProperty / isPropertyPreserved / isPropertyExcluded.

Cleanup: dropped an unnecessary .toInt on calcGpuConcTasks()appendRecommendation already has a Long overload.

Testing

ProfilingAutoTunerSuite — 4 new tests:

  • Drops spark.rapids.sql.concurrentGpuTasks for plugin 25.06.0
  • Keeps recommendation for plugin 25.04.0
  • Keeps recommendation when no plugin jar found
  • Target cluster enforced value wins over the drop logic

Full core test suite (669 tests across 29 suites) passes.

@parthosa parthosa self-assigned this Apr 30, 2026
@github-actions github-actions Bot added the core_tools Scope the core module (scala) label Apr 30, 2026
Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of
concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374), so
the AutoTuner should stop recommending `spark.rapids.sql.concurrentGpuTasks`
for apps using that plugin version or later. Target cluster `enforced` and
`preserve` overrides still take precedence.

Fixes NVIDIA#2089

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa force-pushed the rapids-tools-2089 branch from 03ebd43 to 1286dcc Compare April 30, 2026 20:17
@parthosa parthosa marked this pull request as ready for review April 30, 2026 20:35
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR drops the AutoTuner's spark.rapids.sql.concurrentGpuTasks recommendation for apps running plugin version >= 25.06.0, since that version auto-tunes the value at runtime. It also improves ToolUtils.compareVersions from returning a silent 0 on parse failure to returning Option[Int], and all callers are updated correctly.

Confidence Score: 5/5

Safe to merge — logic is correct, all callers of the updated API are accounted for, and all edge cases are covered by tests.

No P0 or P1 issues found. The compareVersions API change is a strict improvement and all callers are updated. The drop logic integrates cleanly with skippedRecommendations, ignoreRecommendation, and the initRecommendations / calculateClusterLevelRecommendations ordering. Five new tests provide thorough coverage.

No files require special attention.

Important Files Changed

Filename Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala Adds getRapidsPluginJarVersion / isConcurrentGpuTasksAutoTunedByPlugin helpers and drops the concurrentGpuTasks recommendation for plugin >= 25.06.0; integrates correctly with skippedRecommendations, ignoreRecommendation, comment suppression, and the initRecommendations / calculateClusterLevelRecommendations ordering.
core/src/main/scala/org/apache/spark/sql/rapids/tool/ToolUtils.scala Changed compareVersions to return Option[Int] instead of Int, converting a silent failure (returning 0) to an explicit None; all three callers updated correctly.
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala Adds isPropertyUserOverridden helper that correctly combines getUserEnforcedSparkProperty and isPropertyPreserved checks; placed alongside its sibling methods.
core/src/test/scala/com/nvidia/spark/rapids/tool/tuning/ProfilingAutoTunerSuite.scala Adds 5 focused tests via a shared runConcurrentGpuTasksScenario helper covering: drop for >= 25.06, keep for < 25.06, keep with no jar, enforced override, and preserve override.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[calculateClusterLevelRecommendations] --> B{isPropertyUserOverridden\nconcurrentGpuTasks?}
    B -- enforced --> C[appendRecommendation\nenforced value via initRecommendations]
    B -- preserved --> D[appendRecommendation\ncalculated value]
    B -- neither --> E{isConcurrentGpuTasksAutoTunedByPlugin?}
    E --> F[getRapidsPluginJarVersion\nfromRapidsJars + pluginJarRegEx]
    F --> G{distinct versions found?}
    G -- exactly one --> H[compareVersions\njarVer vs 25.06.0]
    G -- zero or 2+ --> I[None → false]
    H --> J{>= 0?}
    J -- yes --> K[skippedRecommendations +=\nconcurrentGpuTasks\ndrops rec + missing comment]
    J -- no or None --> L[appendRecommendation\ncalcGpuConcTasks]
    I --> L
Loading

Reviews (2): Last reviewed commit: "Address review: option-typed compareVers..." | Re-trigger Greptile

- Make `ToolUtils.compareVersions` return `Option[Int]` so a parse failure
  cannot masquerade as version equality. Update the two existing callers
  to handle `None` explicitly.
- Add a `ProfilingAutoTunerSuite` test covering the `preserve` branch of
  `Platform.isPropertyUserOverridden` for the concurrentGpuTasks drop logic.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autotuner core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06

2 participants